-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Replace pre publication failed to commit cluster state exceptions #135706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace pre publication failed to commit cluster state exceptions #135706
Conversation
Changes a FailedToCommitClusterStateException incorrectly thrown prior to cluster state update publication to a NotMasterException
- Introduces a FailedToPublishClusterStateException. - Changes three FailedToCommitClusterStateExceptions thrown prior to the cluster state update publication to FailedToPublishClusterStateExceptions. Relates to: ES-13061
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the third part of a series of PRs fixing how the FailedToCommitClusterStateException
is used in ElasticSearch. As per #135017, FailedToCommitClusterStateException
is defined as:
Thrown when a cluster state publication fails to commit the new cluster state. If publication fails then a new master is elected but the
update might or might not take effect, depending on whether the newly-elected master accepted the published state that failed to
be committed. This exception should only be used when there is <i>ambiguity</i> whether a state update took effect or not.
Currently, FailedToCommitClusterStateException
is used as a 'catch-all' exception thrown at multiple places throughout the Coordinator
and MasterService
during the publication process. Semantically however, it doesn't make sense to throw this exception before the cluster state update is actually sent over the wire, since at this point, we know for certain that the cluster state update failed. FailedToCommitClusterStateException
is intended to display ambiguity.
This work is a pre-requisite to #134213.
Changes
I replace three FailedToCommitClusterStateExceptions
thrown prior to publishing the cluster state with NotMasterExceptions
as per this conversation with David Turner.
Next Steps
The goal of this work is to fix up all erroneously used FailedToCommitClusterStateException
.
Done:
- Update exception messages: #135017
- Update the
FailedToCommitClusterStateException
thrown insideMasterService.Batch.onResponse()
when draining the queue after the threadpool has shut down - #135008 - Change a
FailedToCommitClusterStateException
toNotMasterException
during the pre-publication process: #135548
Todo:
- Replace the
FailedToCommitClusterStateException
exception insideMasterService.BatchingTaskQueue.submitTask
, (here) with aNotMasterException
.
Relates to: ES-13061
I built this PR on top of #135548. This review highlights which changes are included in this PR but will be removed once #135548 is merged
super(in); | ||
} | ||
|
||
public NotMasterException(String msg, Object... args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed as part of #135548 and will disappear once I rebase
) | ||
); | ||
throw new FailedToCommitClusterStateException( | ||
throw new NotMasterException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed as part of #135548 and will disappear when I rebase
if (e instanceof FailedToCommitClusterStateException) { | ||
failure = new FailedToCommitClusterStateException(e.getMessage(), e); | ||
} else { | ||
failure = new NotMasterException(e.getMessage(), e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also be handling FailedToPublishClusterStateException
here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Good catch - I realised I'm missing code here, and in a few other places too
As a note to anyone wanting to review, I need to push a second revision, adding the |
…exceptions' of https://github.com/joshua-adams-1/elasticsearch into replace-pre-publication-failed-to-commit-cluster-state-exceptions
…ster-state-exceptions
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
* <p> | ||
* This is a retryable exception inside {@link TransportMasterNodeAction} | ||
*/ | ||
public class FailedToPublishClusterStateException extends ElasticsearchException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meaningfully different from NotMasterException
? I know the name isn't ideal, but introducing a new ElasticsearchException
subclass carries substantial costs too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by costs? I proposed adding a new exception to make the code easier to understand, especially since NotMasterException
implies the error occurs because the node is no longer the master which isn't true in these cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to think about BwC concerns - what happens if you're in a mixed-version cluster and this exception gets thrown to an older node which doesn't know that it's retryable?
Can you point to a case where NotMasterException
doesn't imply that the node has (or will very shortly have) stopped being the master?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to think about BwC concerns - what happens if you're in a mixed-version cluster and this exception gets thrown to an older node which doesn't know that it's retryable?
Would something like this protect against mixed version clusters?
if (getVersion().onOrAfter(TransportVersion.THE_VERSION_I_ADDED_ABOVE)) {
throw new FailedToPublishClusterStateException();
} else {
throw new FailedToCommitClusterStateException()
}
Can you point to a case where NotMasterException doesn't imply that the node has (or will very shortly have) stopped being the master?
My proposed solution changed the three FailedToCommitClusterStateExceptions
below into FailedToPublishClusterStateExceptions
:
@Override
public void publish(
ClusterStatePublicationEvent clusterStatePublicationEvent,
ActionListener<Void> publishListener,
AckListener ackListener
) {
try {
synchronized (mutex) {
if (mode != Mode.LEADER || getCurrentTerm() != clusterStatePublicationEvent.getNewState().term()) {
logger.debug(
() -> format(
"[%s] failed publication as node is no longer master for term %s",
clusterStatePublicationEvent.getSummary(),
clusterStatePublicationEvent.getNewState().term()
)
);
// === Changed in #135548 === //
throw new NotMasterException(
"node is no longer master for term "
+ clusterStatePublicationEvent.getNewState().term()
+ " while handling publication"
);
}
if (currentPublication.isPresent()) {
assert false : "[" + currentPublication.get() + "] in progress, cannot start new publication";
logger.error(
() -> format(
"[%s] failed publication as already publication in progress",
clusterStatePublicationEvent.getSummary()
)
);
// === Exception 1 === //
throw new FailedToCommitClusterStateException("publication " + currentPublication.get() + " already in progress");
}
assert assertPreviousStateConsistency(clusterStatePublicationEvent);
final ClusterState clusterState;
final long publicationContextConstructionStartMillis;
final PublicationTransportHandler.PublicationContext publicationContext;
final PublishRequest publishRequest;
try {
clusterState = clusterStatePublicationEvent.getNewState();
assert getLocalNode().equals(clusterState.getNodes().get(getLocalNode().getId()))
: getLocalNode() + " should be in published " + clusterState;
publicationContextConstructionStartMillis = transportService.getThreadPool().rawRelativeTimeInMillis();
publicationContext = publicationHandler.newPublicationContext(clusterStatePublicationEvent);
} catch (Exception e) {
logger.debug(() -> "[" + clusterStatePublicationEvent.getSummary() + "] publishing failed during context creation", e);
becomeCandidate("publication context creation");
// === Exception 2 === //
throw new FailedToCommitClusterStateException("publishing failed during context creation", e);
}
try (Releasable ignored = publicationContext::decRef) {
try {
clusterStatePublicationEvent.setPublicationContextConstructionElapsedMillis(
transportService.getThreadPool().rawRelativeTimeInMillis() - publicationContextConstructionStartMillis
);
publishRequest = coordinationState.get().handleClientValue(clusterState);
} catch (Exception e) {
logger.warn(
"failed to start publication of state version ["
+ clusterState.version()
+ "] in term ["
+ clusterState.term()
+ "] for ["
+ clusterStatePublicationEvent.getSummary()
+ "]",
e
);
becomeCandidate("publication creation");
// === Exception 3 === //
throw new FailedToCommitClusterStateException("publishing failed while starting", e);
}
....
- If a publication is already in progress, AFAIU this implies the current node is not the master, because only master nodes can initiate cluster state updates. But what if this is on the new master running at the same time as the old? Can this occur? Because in this case, a
NotMasterException
would not make sense. - I'm not sure this implies the node will not be the master anymore, since I followed the code through and we can throw an
ElasticsearchException
here if serialization fails, and that's independent of a node being master. - AFAICT this can be safely converted to a
NotMasterException
. Digging into the.handleClientValue(clusterState)
function I see code throwing exceptions such as:
throw new CoordinationStateRejectedException("election not won");
throw new CoordinationStateRejectedException("cannot start publishing next value before accepting previous one");
throw new CoordinationStateRejectedException(
"incoming term " + clusterState.term() + " does not match current term " + getCurrentTerm()
);
...
which all imply the current node is not the master anymore since there are term mismatches, and so a NotMasterException
is correct
What do you mean by
Case 1 has an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (just comment nits but no need for another round)
...rc/main/java/org/elasticsearch/cluster/coordination/FailedToCommitClusterStateException.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/cluster/coordination/FailedToCommitClusterStateException.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah also could you adjust the top-level comment in the PR to look more like a commit message? I generally recommend always having this reflect the eventual commit message, and if you need to give more context to reviewers etc then do so in a reply. Otherwise you'll forget to adjust it on merge sometimes and it'll make a mess of the git log
.
Also maybe (if you think it's not obvious) should we comment on why NotMasterException
is appropriate in these cases because of the preceding becomeCandidate()
?
Co-authored-by: David Turner <[email protected]>
…ster-state-exceptions
…exceptions' of https://github.com/joshua-adams-1/elasticsearch into replace-pre-publication-failed-to-commit-cluster-state-exceptions
Replaces three
FailedToCommitClusterStateExceptions
thrown prior to publishing the cluster state withNotMasterExceptions
Relates to: ES-13061